Skip to content

Conversation

@forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Jul 11, 2025

Add ESGF links and more simulations to v1 data

@forsyth2 forsyth2 self-assigned this Jul 11, 2025
@forsyth2 forsyth2 marked this pull request as ready for review July 11, 2025 20:08
@forsyth2
Copy link
Collaborator Author

@chengzhuzhang This is ready for review. I added the ESGF links that had data available. Web rendering can be seen at https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60/html/v1/WaterCycle/simulation_data/simulation_table.html

@forsyth2 forsyth2 requested a review from chengzhuzhang July 11, 2025 20:09
@forsyth2 forsyth2 mentioned this pull request Jul 11, 2025
@forsyth2 forsyth2 changed the title Add ESGF links for v1 data Add ESGF links and more simulations to v1 data Jul 11, 2025
Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang I added df0cfdb to begin the work of adding the large ensemble, but there's still a bit more to do on that, as described in this self-review.

Results from this commit can be seen at https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try2/html/v1/WaterCycle/simulation_data/simulation_table.html.

@@ -0,0 +1,18 @@
# This will be a problem if these simulations are ever removed from the publication archives!
for i in $(seq 1 20); do
hsi ln -s /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens$i /home/projects/e3sm/www/WaterCycle/E3SMv1/LR/LE_historical_ens$i
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HSI/HPSS adds a @ to the end of its symlinks, but that may just be a visual indicator. In any case, HPSS paths and data sizes aren't being displayed on https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try2/html/v1/WaterCycle/simulation_data/simulation_table.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the other data sets are showing a size of 0, but these doesn't even show a size of at all, so that makes me think the path isn't being found.

That said, they do show up in my output logs:

1	/home/projects/e3sm/www/WaterCycle/E3SMv1/LR/LE_historical_ens11
-----------------------
0	total 512-byte blocks, 0 Files (0 bytes)

So, it seems to read it as an empty path. I wonder if symlinks show zero size?

This one shows up as 0 in the table:

341850452	2	/home/projects/e3sm/www/WaterCycle/E3SMv1/HR/cori-haswell.20190513.F2010LRtunedHR.plus4K.noCNT.ne30_oECv3/
-----------------------
341850452	total 512-byte blocks, 2 Files (175,027,431,424 bytes)

So, it must be because 175x10^9 bytes is basically 0 TB (0.175x10^12). Indeed, this 113x10^12 one shows up as 113:

221651622324	820	/home/projects/e3sm/www/WaterCycle/E3SMv1/HR/20211021-maint-1.0-tro.A_WCYCLSSP585_CMIP6_HR.ne120_oRRS18v3_ICG.unc12-3rd-attempt/
-----------------------
221651622324	total 512-byte blocks, 820 Files (113,485,630,629,888 bytes)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 could you double check the file size from /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens$i, hopefully there is no corruption during zstash archive or transfer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang It's definitely an issue with the symlinks; I'm discussing with NERSC support. The original paths are fine, e.g.,:

hsi du /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens1
# 49970007900	95	/home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens1/
# -----------------------
# 49970007900	total 512-byte blocks, 95 Files (25,584,644,044,800 bytes)

v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H1_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 1, none, ,
v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H2_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 2, none, ,
v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H3_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 3, none, ,
v1, WaterCycle, LR, LargeEnsemble, LE_historical_ens1, , , historical-large-ensemble, 1, none, ,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the large ensemble data available on ESGF? If so, what's the experiment name? I assume it's not historical-large-ensemble

And actually on that note, some of the other v1 data sets may be missing ESGF links simply because I guessed the experiment name wrong. (I'm not seeing a way to know the experiment, or ensemble number for that matter, from https://e3sm.atlassian.net/wiki/spaces/ED/pages/4495441922/V1+Simulation+backfill+WIP)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the v1 large ensemble data are available on ESGF in CMIP format. the experiment and ensemble names can be found here:https://github.com/E3SM-Project/datasm/blob/master/datasm/resources/v1_LE_dataset_spec.yaml. @TonyB9000 I think that you documented the mapping between LE native ensemble index to CMIP ensemble, e.g. r1i2p2f1. but forgot if that is for v1 or v2...Could you help check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E3SM LE_archive_refactor.xlsb.xlsx

I think this was v1, since if it was v2 I would have had to distinguish them, but nothing in the naming indicated v1 or v2.

I keep poking around

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory "/p/user_pub/e3sm/archive/External/" holds 5 related subdirectories:

E3SMv1_LE
E3SMv1_LE_ext
E3SMv1_LE_ssp370
E3SMv2_LE
E3SMv2_LE_ssp370

The E3SMv2_LE has a file I created called "Arch_Translator_E3SMv2_LE", and it holds

Ensemble,Archive,Branch_time_in_parent
ens6,v2.LR.historical_0111,40150.0
ens7,v2.LR.historical_0121,43800.0
ens8,v2.LR.historical_0131,47450.0
ens9,v2.LR.historical_0141,51100.0
ens10,v2.LR.historical_0161,58400.0
ens11,v2.LR.historical_0171,62050.0
ens12,v2.LR.historical_0181,65700.0
ens13,v2.LR.historical_0191,69350.0
ens14,v2.LR.historical_0211,76650.0
ens15,v2.LR.historical_0221,80300.0
ens16,v2.LR.historical_0231,83950.0
ens17,v2.LR.historical_0241,87600.0
ens18,v2.LR.historical_0261,94900.0
ens19,v2.LR.historical_0271,98550.0
ens20,v2.LR.historical_0281,102200.0
ens21,v2.LR.historical_0291,105850.0

(ensembles 1-5 missing because created independent of LE in v2 historical)

Likewise, E3SMv2_LE_ssp370/ holds a file named "Arch_Translator_E3SMv2_LE_ssp370", and it holds:

Ensemble,Archive,Branch_time_in_parent
ens1,v2.LR.SSP370_0101,36500.0
ens6,v2.LR.SSP370_0111,40150.0
ens7,v2.LR.SSP370_0121,43800.0
ens8,v2.LR.SSP370_0131,47450.0
ens9,v2.LR.SSP370_0141,51100.0
ens2,v2.LR.SSP370_0151,54750.0
ens10,v2.LR.SSP370_0161,58400.0
ens11,v2.LR.SSP370_0171,62050.0
ens12,v2.LR.SSP370_0181,65700.0
ens13,v2.LR.SSP370_0191,69350.0
ens3,v2.LR.SSP370_0201,73000.0
ens14,v2.LR.SSP370_0211,76650.0
ens15,v2.LR.SSP370_0221,80300.0
ens16,v2.LR.SSP370_0231,83950.0
ens17,v2.LR.SSP370_0241,87600.0
ens4,v2.LR.SSP370_0251,91250.0
ens18,v2.LR.SSP370_0261,94900.0
ens19,v2.LR.SSP370_0271,98550.0
ens20,v2.LR.SSP370_0281,102200.0
ens21,v2.LR.SSP370_0291,105850.0
ens5,v2.LR.SSP370_0301,109500.0

I don't know how much that helps. Special functions were written that translate a given CMIP6 dataset_id to its corresponding E3SM "native" dataset_id. But for those functions to work (parent_native_dsid.sh, etc) one must supply the alternate "Archive_Map" for v1 or v2 LE, as these are not part of the E3SM "dataset_spec.yaml".

We can probably generate a "cmip-case" to "native-case" mapping file. Might take a day or so.

Copy link
Collaborator

@TonyB9000 TonyB9000 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The historical cases and ssp370 cases are independent.

Walking the tree for "Project: E3SM" a bit further, you will see a "cmip_case", and yes, there is a 1-to-1 mapping between each native "ens#" and the corresponding "Project: CMIP" case, as in

"(native) ens#" corresponds to "(CMIP6) r#" of the variant label ("realization index").

    E3SM:
        '1_0_LE':
            historical:
                start: 1850
                end: 2014
                ens:
                    - ens1
                    - ens2
                    - ens3
                    - ens4
                    - ens5
                    - ens6
                    - ens7
                    - ens8
                    - ens9
                    - ens10
                    - ens11
                    - ens12
                    - ens13
                    - ens14
                    - ens15
                    - ens16
                    - ens17
                    - ens18
                    - ens19
                    - ens20
                except:
                    - TREFMNAV
                    - TREFMXAV
                campaign: DECK-v1
                science_driver: Water Cycle
                cmip_case: CMIP6.CMIP.UCSB.E3SM-1-0.historical

If you put these lines into your (acme1) ~/.bashrc file:

export DSM_GETPATH=/p/user_pub/e3sm/staging/Relocation/.dsm_get_root_path.sh
alias list_e3sm="python /p/user_pub/e3sm/staging/tools/list_e3sm_dsids.py"
alias list_cmip="python /p/user_pub/e3sm/staging/tools/list_cmip6_dsids.py"

(and issue "source ~/.bashrc")

And then

1. git clone https://github.com/E3SM-Project/datasm.git
2. cd datasm
3. conda env create -n <env_name> -f conda-env/prod.yml
4. conda activate <env_name>
5. pip install .

Then your environment will have "datasm/util" and its functions available to any python, via "import datasm.util" or "from datasm.util import (selected functions)".

You can issue list_e3sm -d <path_to_the_dataset_spec> and generate ALL E3SM dataset_ids for that dataset_spec.

Likewise, use list_cmip -d <path_to_the_dataset_spec> to generate all corresponding CMIP6 dataset_ids.

These utilities will "walk" the respected YAML trees to express every branch. If no "-d dataset_spec" is given, the default dataset_spec.yaml (staging/resource/dataset_spec.yaml) is applied.

Other than this, I'm not quite sure what you need. Perhaps I can generate stuff for you, if I understand what you are looking for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TonyB9000 I'm just trying to determine the URL to link to. That requires knowing the query parameter values. But upon further inspection, it appears ESGF links might not be available for the large ensemble, in which case it's a moot point.
ESGF Page
^Notice none of the available experiment IDs suggest large ensemble

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even clicking "historical", it's just the 5 ensemble members of the regular historical. Again, no indicator of large ensemble.
ESGF Page historical

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the v1 LE is published under project UCSB, if you leave out the Institution ID, the large ensemble should pop up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But for v2 LE the project listed is "E3SM-Project" (21 ensembles). The CMIP6 datasets are not really distinguished as "LE" except that the variant labels range from r6 to r21 (16 ensembles). The native data is distinguished by Model = "2_0_LE". Likewise, the v1_LE native data has Model = "1_0_LE" (But native data is no longer available via ESGF/Metagrid.)

done

# Symlink last remaining large simulation
# This will be a problem if ndk ever deletes the source!
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang I meant to include this in the self-review I just posted. The symlinks are fine as long as we are guaranteed that people don't delete the source directories like /home/projects/e3sm/www/publication-archives/ or /home/n/ndk/2019/theta.20190910.branch_noCNT.n825def.unc06.A_WCYCL1950S_CMIP6_HR.ne120_oRRS18v3_ICG. Is that something we can be sure of?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, tagging directory owners @TonyB9000 and @ndkeen, please make sure not to delete above directories.

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang @TonyB9000 Ok I've added the large ensemble & the existing ESGF links. See https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try9/html/v1/WaterCycle/simulation_data/simulation_table.html for a rendered version of the web page. This is ready for final review.

v1 data page

@TonyB9000 I've noted symlinked HPSS paths with (symlink) ...hpss_path...; is that going to interfere with any automated data retrieval you do from these pages?

@chengzhuzhang
Copy link
Collaborator

@forsyth2 thanks for adding v1 LE and the ESGF links. I just note that for the the simulation overview page, could you also add:

  1. the v1 LE
  2. the overview paper describing v1 LE: Stevenson et al. 2023, https://doi.org/10.1029/2023MS003653
    Thanks!

@forsyth2
Copy link
Collaborator Author

@TonyB9000
Copy link
Collaborator

@forsyth2 @chengzhuzhang
I've noted symlinked HPSS paths with (symlink) ...hpss_path...; is that going to interfere with any automated data retrieval you do from these pages?

Yes, it most certainly will. A column labeled "HPSS Path" should not be polluted with non-functional commentary. People need to understand that we use computers to automate. As nice as it is to have human-friendly material, such should be secondary to functional considerations.

Personally, I would have the default date-timestamp on ALL log-files be 8+ HEX chars (like "D85A33B2", representing Epoch-seconds). Very unfriendly to look at? Then pass it through a "prettifier" that converts the log entry to "2025-07-15 09:42:30", or if you like, "The Fifteenth Day of Our Lord, July 2025 AD, at the 9th hour, 42nd minute, and 30th second of the morning in the Pacific Standard Timezone".

Instead, I will need to munge code to toss out everything in the returned HPSS-Path that occurs before the first "/". For now, at least.

@forsyth2
Copy link
Collaborator Author

People need to understand that we use computers to automate. As nice as it is to have human-friendly material, such should be secondary to functional considerations.

This page is for humans though. It almost seems like we should have some sort of a output file meant for a computer to read, rather than having a program parse the information from HTML... As I noted in a previous email:

I think perhaps the most straightforward thing to do here is to modify "generate_tables" in https://github.com/E3SM-Project/e3sm_data_docs/blob/main/utils/generate_tables.py#L227 to produce not only the rst table but also an equivalent csv (or better yet construct the table from csv per #30). Then, it's exactly the data you need, in the right format.

That is, I believe the fundamental issue here is that we're relying on HTML serving both computers & humans, when we should just be outputting computer-readable material elsewhere.

@forsyth2
Copy link
Collaborator Author

@TonyB9000 if you provide me with an exact list of data you need from these tables, I should be able to easily produce that in a machine readable file.

@forsyth2
Copy link
Collaborator Author

if you provide me with an exact list of data you need from these tables, I should be able to easily produce that in a machine readable file.

That would need to be part of a separate PR though, as the work is distinct from adding the v1 data.
In the meantime, do you need all the columns clean, or can I add the "(symlink)" note to say the say the simulation name column?

@TonyB9000
Copy link
Collaborator

@forsyth2

we're relying on HTML serving both computers & humans

Indeed. In fact, to avoid inconsistencies, the focus should be to produce the "machine-readable" version of materials, and then use that as a primary source for HTML creation and human readable stuff, augmented with commentaries, etc.

Machine ==> Human: Easy
Human ==> Machine: HARD.

@forsyth2 forsyth2 mentioned this pull request Jul 15, 2025
@forsyth2
Copy link
Collaborator Author

@TonyB9000 Great, I prototyped a solution at #61. Can you please review #61 (review)? If you approve of that, I think I can go ahead and merge both PRs.

@TonyB9000
Copy link
Collaborator

@forsyth2

do you need all the columns clean

I should clarify. At runtime, I consult my own "NERSC_Archive_Locator" file, whose entries are (e.g.):

LR:AMIP,v2.LR.amip_0101,2,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0101
LR:AMIP,v2.LR.amip_0201,2,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0201
LR:AMIP,v2.LR.amip_0301,2,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0301
LR:AMIP,v2.LR.amip_0101_bonus,2,na,na,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0101_bonus
LR:RFMIP,v2.LR.piClim-control,1,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-control
LR:RFMIP,v2.LR.piClim-histall_0021,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histall_0021
LR:RFMIP,v2.LR.piClim-histall_0031,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histall_0031
LR:RFMIP,v2.LR.piClim-histall_0041,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histall_0041
LR:RFMIP,v2.LR.piClim-histaer_0021,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histaer_0021
LR:RFMIP,v2.LR.piClim-histaer_0031,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histaer_0031
LR:RFMIP,v2.LR.piClim-histaer_0041,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histaer_0041
LR:Other,v2_ndgclim_t6h_1850aer,0,na,na,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2_ndgclim_t6h_1850aer
LR:Other,v2_ndgclim_t6h_2010aer,0,na,na,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2_ndgclim_t6h_2010aer
NARRM:DECK,v2.NARRM.piControl,80,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/NARRM/v2.NARRM.piControl
NARRM:DECK,v2.NARRM.abrupt-4xCO2_0101,24,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/NARRM/v2.NARRM.abrupt-4xCO2_0101

This was created by MANUALLY scraping the HTML data. Note that the hyperlinks are removed, I don't use the first column.

At runtime (due to the magic of having created a "local Archive_Map" (paths to archives on Chrysalis AND zstash file extraction patterns), if I don't have the data in the warehouse BUT it is listed in the Local Archive_Map, I take the "basename" of the archive path (the case_id, like "DECK,v2.NARRM.piControl"), and I look it up in the NERSC_Archive_Locator (field 2). Where a match is found, I return fields 3 (Volume) and 6 (NERSC HPSS Archive Path).

I then (hope to) use "zstash --check" to pull over the archive in question.

Since I (presently) create the NERSC Archive_Locator manually, I simply edit out extraneous material.

@forsyth2
Copy link
Collaborator Author

I'm a little confused. If the current process is manual, then what's the problem with having "(symlink) " in the hpss path cell?

In any case, #61 should pave the way to full automation well.

@TonyB9000
Copy link
Collaborator

TonyB9000 commented Jul 15, 2025

@forsyth2 I guess "semi-manual", as I do use tools to strip formatting from the HTML copy. But yes, it is only a minor inconvenience. These things do add up (mapfile/region-file selection, user_metadata updates, etc) so I am simply venting my frustrations on the system overall. Each of these little (manual) things are:

  1. Something to forget to do until the last minute when things fail
  2. Opportunities to mis-copy or otherwise screw up configuration

Hence, forcing automation not only eases the manual burden, it isolates decisions to a "fix-it-once-and-forget-it" regime of operation.

@forsyth2 forsyth2 merged commit 5c5dfc3 into main Jul 15, 2025
1 check passed
@forsyth2 forsyth2 deleted the v1-data-esgf branch July 15, 2025 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants